$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\Norm}{\mathcal{N}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}\;} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}\;}$
Remember when we were discussing the complexity of models? The simplest model was a constant. A simple way to predict rainfall is to ignore all measurements and just predict the average rainfall. If a linear model of measurements may do no better. The degree to which it does do better can be expressed in the relative sum of squared errors (RSE) or
$$RSE = \frac{\sum_{i=1}^N (\tv_i - \xv_i^T \wv)^2}{\sum_{i=1}^N (\tv_i - \bar{\Tv})^2}$$If RSE is 1, then your linear model is no better than using the mean. The closer RSE is to 0, the better your linear model is.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
np.any?
In [3]:
def makeMPGData(filename='auto-mpg.data'):
def missingIsNan(s):
return np.nan if s == b'?' else float(s)
data = np.loadtxt(filename, usecols=range(8), converters={3: missingIsNan})
print("Read",data.shape[0],"rows and",data.shape[1],"columns from",filename)
goodRowsMask = np.isnan(data).sum(axis=1) == 0
data = data[goodRowsMask,:]
print("After removing rows containing question marks, data has",data.shape[0],"rows and",data.shape[1],"columns.")
X = data[:,1:]
T = data[:,0:1]
Xnames = ['cylinders','displacement','horsepower','weight','acceleration','year','origin']
Tname = 'mpg'
return X,T,Xnames,Tname
In [4]:
X,T,Xnames,Tname = makeMPGData()
In [7]:
means = X.mean(0)
stds = X.std(0)
nRows = X.shape[0]
Xs1 = np.insert((X-means)/stds, 0, 1, axis=1) # insert column of ones in new 0th column
# Xs1 = np.hstack(( np.ones((nRows,1)), (X-means)/stds ))
w = np.linalg.lstsq(Xs1, T)[0]
w
Out[7]:
In [8]:
# predict = Xs1.dot(w)
predict = Xs1 @ w
RSE = np.sum((T-predict)**2) / np.sum((T - T.mean(0))**2)
RSE
Out[8]:
So, our linear model seems to be quite a bit better than using just the mean mpg.
Well, maybe our model would do better if it was a little closer to just the mean, or, equivalently, just using the bias weight. This is the question that drives the derivation of "ridge regression". Let's add a term to our sum of squared error objective function that is the sum of all weight magnitudes except the bias weight. Then, we not only minimize the sum of squared errors, we also minimize the sum of the weight magnitudes:
$$ \sum_{i=1}^N (\tv_i - \xv_i^T \wv)^2 + \lambda \sum_{i=2}^N w_i^2$$Notice that $\lambda$ in there. With $\lambda=0$ we have our usual linear regression objective function. With $\lambda>0$, we are adding in a penalty for the weight magnitudes.
How does this change our solution for the best $\wv$? You work it out. You should get
$$ \wv = (X^T X + \lambda I)^{-1} X^T T $$except instead of using
$$ \lambda I = \begin{bmatrix} \lambda & 0 & \dotsc & 0\\ 0 & \lambda & \dotsc & 0\\ \vdots \\ 0 & 0 & \dotsc & \lambda \end{bmatrix} $$we want
$$ \begin{bmatrix} 0 & 0 & \dotsc & 0\\ 0 & \lambda & \dotsc & 0\\ \vdots \\ 0 & 0 & \dotsc & \lambda \end{bmatrix} $$so we don't penalize the bias weight, the weight multiplying the constant 1 input component.
Now, what value should $\lambda$ be? Must determine empirically, by calculating the sum of squared error on test data.
Actually, we should not find the best value of $\lambda$ by comparing error on the test data. This will give us a too optimistic prediction of error on novel data, because the test data was used to pick the best $\lambda$. We really must hold out another partition of data from the training set for this. This third partition is often called the model validation set. So, we partition our data into disjoint training, validation, and testing subsets, and
Let's do it!
In [7]:
lamb = 0.1
D = Xs1.shape[1]
lambdaDiag = np.eye(D) * lamb
lambdaDiag[0,0] = 0
lambdaDiag
Out[7]:
In [8]:
def makeLambda(D,lamb=0):
lambdaDiag = np.eye(D) * lamb
lambdaDiag[0,0] = 0
return lambdaDiag
makeLambda(3,0.2)
Out[8]:
In [9]:
w = np.linalg.lstsq(Xs1.T @ Xs1 + lambdaDiag, Xs1.T @ T)[0]
w
Out[9]:
In [10]:
D = Xs1.shape[1]
w1 = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,0.1), Xs1.T @ T)[0]
w2 = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,0), Xs1.T @ T)[0]
np.hstack((w1,w2))
Out[10]:
In [11]:
for lamb in [0,0.1,1,10,100]:
w = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,lamb), Xs1.T @ T)[0]
plt.plot(w)
In [12]:
lambdas = [0,0.1,1,10,100,1000]
for lamb in lambdas:
w = np.linalg.lstsq(Xs1.T @ Xs1 + makeLambda(D,lamb), Xs1.T @ T)[0]
plt.plot(Xs1[:30] @ w)
plt.plot(T[:30],'ro',lw=5,alpha=0.8)
plt.legend(lambdas,loc='best')
Out[12]:
Which $\lambda$ value is best? Careful. What is the best value of $\lambda$ if just comparing error on training data?
Now, careful again! Can't report expected error from testing data that is also used to pick best value of $\lambda$. Error is likely to be better than what you will get on new data that was not used to train the model and also was not used to pick value of $\lambda$.
Need a way to partition our data into training, validation and testing subsets. Let's write a function that makes these partitions randomly, given the data matrix and the fractions to be used in the three partitions.
In [9]:
def partition(X,T,trainFraction=0.8, validateFraction=0.1, testFraction=0.1):
'''Usage: Xtrain,Ttrain,Xval,Tval,Xtest,Ttext = partition(X,T,0.8,0.2,0.2)'''
if trainFraction + validateFraction + testFraction != 1:
raise ValueError("Train, validate and test fractions must sum to 1. Given values sum to " + str(trainFraction+validateFraction+testFraction))
n = X.shape[0]
nTrain = round(trainFraction * n)
nValidate = round(validateFraction * n)
nTest = round(testFraction * n)
if nTrain + nValidate + nTest != n:
nTest = n - nTrain - nValidate
# Random order of data matrix row indices
rowIndices = np.arange(X.shape[0])
np.random.shuffle(rowIndices)
# Build X and T matrices by selecting corresponding rows for each partition
Xtrain = X[rowIndices[:nTrain],:]
Ttrain = T[rowIndices[:nTrain],:]
Xvalidate = X[rowIndices[nTrain:nTrain+nValidate],:]
Tvalidate = T[rowIndices[nTrain:nTrain+nValidate],:]
Xtest = X[rowIndices[nTrain+nValidate:nTrain+nValidate+nTest],:]
Ttest = T[rowIndices[nTrain+nValidate:nTrain+nValidate+nTest],:]
return Xtrain,Ttrain,Xvalidate,Tvalidate,Xtest,Ttest
In [10]:
X = np.arange(20).reshape((10,2))
X
Out[10]:
In [11]:
T = np.arange(10).reshape((-1,1))
T
Out[11]:
In [12]:
X = np.arange(20).reshape((10,2))
T = np.arange(10).reshape((-1,1))
Xtrain,Ttrain,Xval,Tval,Xtest,Ttest = partition(X,T,0.6,0.2,0.2)
In [17]:
print("Xtrain:")
print(Xtrain)
print(" Ttrain:")
print(Ttrain)
print("\n Xval:")
print(Xval)
print(" Tval:")
print(Tval)
print("\n Xtest:")
print(Xtest)
print(" Ttest:")
print(Ttest)
Now we can train our model using several values of $\lambda$ on Xtrain,Train and calculate the model error on Xval,Tval. Then pick best value of $\lambda$ based on error on Xval,Tval. Finally, calculate error of model using best $\lambda$ on Xtest,Ttest as our estimate of error on new data.
However, these errors will be affected by the random partitioning of the data. Repeating with new partitions may result in a different $\lambda$ being best. We should repeat the above procedure many times and average over the results. How many repetitions do we need?
Another approach, commonly followed in machine learning, is to first partition the data into $k$ subsets, called "folds". Pick one fold to be the test partition, another fold to be the validate partition, and collect the remaining folds to be the train partition. We can do this $k\,(k-1)$ ways. (Why?) So, with $k=5$ we get 20 repetitions performed in a way that most distributes samples among the partitions.
First, a little note on the yield
statement in python. The yield
statement is like return
except that execution pauses at this point in the function, after returning the values. When the function is called again, it continues from that paused point.
In [13]:
def count(n):
for a in range(n):
yield a
In [14]:
count(4)
Out[14]:
In [15]:
list(count(4))
Out[15]:
In [16]:
for i in count(5):
print(i)
In [17]:
zip?
In [19]:
def partitionKFolds(X,T,nFolds,shuffle=False,nPartitions=3):
'''Usage: for Xtrain,Ttrain,Xval,Tval,Xtest,Ttext in partitionKFolds(X,T,5):'''
# Randomly arrange row indices
rowIndices = np.arange(X.shape[0])
if shuffle:
np.random.shuffle(rowIndices)
# Calculate number of samples in each of the nFolds folds
nSamples = X.shape[0]
nEach = int(nSamples / nFolds)
if nEach == 0:
raise ValueError("partitionKFolds: Number of samples in each fold is 0.")
# Calculate the starting and stopping row index for each fold.
# Store in startsStops as list of (start,stop) pairs
starts = np.arange(0,nEach*nFolds,nEach)
stops = starts + nEach
stops[-1] = nSamples
startsStops = list(zip(starts,stops))
# Repeat with testFold taking each single fold, one at a time
for testFold in range(nFolds):
if nPartitions == 3:
# Repeat with validateFold taking each single fold, except for the testFold
for validateFold in range(nFolds):
if testFold == validateFold:
continue
# trainFolds are all remaining folds, after selecting test and validate folds
trainFolds = np.setdiff1d(range(nFolds), [testFold,validateFold])
# Construct Xtrain and Ttrain by collecting rows for all trainFolds
rows = []
for tf in trainFolds:
a,b = startsStops[tf]
rows += rowIndices[a:b].tolist()
Xtrain = X[rows,:]
Ttrain = T[rows,:]
# Construct Xvalidate and Tvalidate
a,b = startsStops[validateFold]
rows = rowIndices[a:b]
Xvalidate = X[rows,:]
Tvalidate = T[rows,:]
# Construct Xtest and Ttest
a,b = startsStops[testFold]
rows = rowIndices[a:b]
Xtest = X[rows,:]
Ttest = T[rows,:]
# Return partition matrices, then suspend until called again.
yield Xtrain,Ttrain,Xvalidate,Tvalidate,Xtest,Ttest,testFold
else:
# trainFolds are all remaining folds, after selecting test and validate folds
trainFolds = np.setdiff1d(range(nFolds), [testFold])
# Construct Xtrain and Ttrain by collecting rows for all trainFolds
rows = []
for tf in trainFolds:
a,b = startsStops[tf]
rows += rowIndices[a:b].tolist()
Xtrain = X[rows,:]
Ttrain = T[rows,:]
# Construct Xtest and Ttest
a,b = startsStops[testFold]
rows = rowIndices[a:b]
Xtest = X[rows,:]
Ttest = T[rows,:]
# Return partition matrices, then suspend until called again.
yield Xtrain,Ttrain,Xtest,Ttest,testFold
In [20]:
X = np.arange(20).reshape((10,2))
T = np.arange(10).reshape((-1,1))
k = 0
for Xtrain,Ttrain,Xval,Tval,Xtest,Ttest,testFold in partitionKFolds(X,T,5):
k += 1
print("Fold",k)
print(" Xtrain:")
print(Xtrain)
print(" Ttrain:")
print(Ttrain)
print("\n Xval:")
print(Xval)
print(" Tval:")
print(Tval)
print("\n Xtest:")
print(Xtest)
print(" Ttest:")
print(Ttest)
Typical use of these partitions is shown below. It is most handy to just collect all results in a matrix and calculate averages afterwards, rather than accumulating each result and dividing by the number of repetitions at the end.
In [21]:
def train(X,T,lamb):
means = X.mean(0)
stds = X.std(0)
n,d = X.shape
Xs1 = np.insert( (X - means)/stds, 0, 1, axis=1)
lambDiag = np.eye(d+1) * lamb
lambDiag[0,0] = 0
w = np.linalg.lstsq( Xs1.T @ Xs1 + lambDiag, Xs1.T @ T)[0]
return {'w': w, 'means':means, 'stds':stds}
def use(X,model):
Xs1 = np.insert((X-model['means'])/model['stds'], 0, 1, axis=1)
return Xs1 @ model['w']
def rmse(A,B):
return np.sqrt(np.mean( (A-B)**2 ))
In [29]:
lambdas = [0,1,5,10,20]
results = []
for Xtrain,Ttrain,Xval,Tval,Xtest,Ttest,_ in partitionKFolds(X,T,5):
for lamb in lambdas:
model = train(Xtrain,Ttrain,lamb)
predict = use(Xval,model)
results.append([lamb,
rmse(use(Xtrain,model),Ttrain),
rmse(use(Xval,model),Tval),
rmse(use(Xtest,model),Ttest)])
results = np.array(results)
print(results)
print(results.shape)
# print(results)
avgresults = []
for lam in lambdas:
print(lam)
print(results[results[:,0]==lam,1:])
avgresults.append( [lam] + np.mean(results[results[:,0]==lam,1:],axis=0).tolist())
avgresults = np.array(avgresults)
print(avgresults)
In [28]:
plt.plot(avgresults[:,0],avgresults[:,1:],'o-')
plt.xlabel('$\lambda$')
plt.ylabel('RMSE')
plt.legend(('Train','Validate','Test'),loc='best');
In [ ]: